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Abstract. Within the framework of statistical learning theory we analyze in detail 

the so-called elastic-net regularization scheme proposed by Zou and Hastie |35] for the 

selection of groups of correlated variables. To investigate on the statistical properties 

of this scheme and in particular on its consistency properties, we set up a suitable 

mathematical framework. Our setting is random-design regression where we allow the 

response variable to be vector-valued and we consider prediction functions which are 

linear combination of elements (features) in an infinite-dimensional dictionary. Under the 

assumption that the regression function admits a sparse representation on the dictionary, 

QQ we prove that there exists a particular ^^elastic-net representation!' of the regression 

f^ function such that, if the number of data increases, the elastic-net estimator is consistent 

^^ not only for prediction but also for variable/feature selection. Our results include finite- 

^^ sample bounds and an adaptive scheme to select the regularization parameter. Moreover, 

'"^ using convex analysis tools, we derive an iterative thresholding algorithm for computing 

l—j the elastic-net solution which is different from the optimization procedure originally 

p^ proposed in jl5] . 

(N 

^ 1. Introduction 

'^ We consider the standard framework of supervised learning, that is nonparametric 

^ regression with random design. In this setting, there is an input-output pair (X, Y) e 

'""' X X y with unknown probabihty distribution P, and the goal is to find a prediction 

function fn : X —^ y, based on a training set (Xi, Yi), . . . , (X„,K„) of n independent 
random pairs distributed as (X, F). A good solution /„ is such that, given a new input 
pq' X E X, the value /n(x) is a good prediction of the true output y E y. When choosing 

^ the square loss to measure the quality of the prediction, as we do throughout this paper, 

this means that the expected risk E \\Y — /„(X)| ] is small, or, in other words, that /„ 
is a good approximation of the regression function /* (x) = E [y | X = x] minimizing this 
56 risk. 

O In many learning problems, a major goal besides prediction is that of selecting the vari- 

K*- ables that are relevant to achieve good predictions. In the problem of variable selection we 

S> are given a set (^-y)-ygr of functions from the input space X into the output space y and 

%-i we aim at selecting those functions which are needed to represent the regression function, 

where the representation is typically given by a linear combination. The set ('?/'^)^gr is 
usually called dictionary and its elements features. We can think of the features as mea- 
surements used to represent the input data, as providing some relevant parameterization 
of the input space, or as a (possibly overcomplete) dictionary of functions used to rep- 
resent the prediction function. In modern applications, the number p of features in the 
dictionary is usually very large, possibly much larger that the number n of examples in 
the training set. This situation is often referred to as the "large p, small n paradigm" 
[9], and a key to obtain a meaningful solution in such case is the requirement that the 
prediction function /„ is a linear combination of only a few elements in the dictionary, i.e. 
that /„ admits a sparse representation. 
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The above setting can be illustrated by two examples of applications we are currently 
working on and which provide an underlying motivation for the theoretical framework 
developed in the present paper. The first application is a classification problem in com- 
puter vision, namely face detection [T71 [191 [18]. The training set contains images of faces 
and non-faces and each image is represented by a very large redundant set of features 
capturing the local geometry of faces, for example wavelet-like dictionaries or other local 
descriptors. The aim is to find a good predictor able to detect faces in new images. 

The second application is the analysis of microarray data, where the features are the 
expression level measurements of the genes in a given sample or patient, and the output is 
either a classification label discriminating between two or more pathologies or a continuous 
index indicating, for example, the gravity of an illness. In this problem, besides prediction 
of the output for examples-to-come, another important goal is the identification of the 
features that are the most relevant to build the estimator and would constitute a gene 
signature for a certain disease p^ H] . In both applications, the number of features we 
have to deal with is much larger than the number of examples and assuming sparsity of 
the solution is a very natural requirement. 

The problem of variable/feature selection has a long history in statistics and it is known 
that the brute- force approach (trying all possible subsets of features), though theoretically 
appealing, is computationally unfeasible. A first strategy to overcome this problem is 
provided by greedy algorithms. A second route, which we follow in this paper, makes 
use of sparsity-based regularization schemes (convex relaxation methods). The most well- 
known example of such schemes is probably the so-called Lasso regression |38j - also 
referred to in the signal processing literature as Basis Pursuit Denoising [13] - where a 
coefficient vector /3„ is estimated as the minimizer of the empirical risk penalized with the 
£i-norm, namely 

Pn = argmin - V |>"i - //3(^i)|^ 

where //3 = "^y^T f^j'^'y A is a suitable positive regularization parameter and {ip-y)'r(^r a 
given set of features. An extension of this approach, called bridge regression, amounts 
to replacing the £i-penalty by an £p-penalty [23j. It has been shown that this kind of 
penalty can still achieve sparsity when p is bigger, but very close to 1 (see [i^B]). For 
this class of techniques, both consistency and computational aspects have been studied. 
Non-asymptotic bounds within the framework of statistical learning have been studied in 
several papers [251 S [SHI 133) EH IMl [HI [26] . A common feature of these results is that they 
assume that the dictionary is finite (with cardinality possibly depending on the number of 
examples) and satisfies some assumptions about the linear independence of the relevant 
features - see [26] for a discussion on this point - whereas y is usually assumed to be M. 
Several numerical algorithms have also been proposed to solve the optimization problem 
underlying Lasso regression and are based e.g. on quadratic programming [13], on the 
so-called LARS algorithm P2j or on iterative soft-thresholding (see [2] and references 
therein) . 

Despite of its success in many applications, the Lasso strategy has some drawback 
in variable selection problems where there are highly correlated features and we need to 
identify all the relevant ones. This situation is of uttermost importance for e.g. microarray 
data analysis since, as well-known, there is a lot of functional dependency between genes 
which are organized in small interacting networks. The identification of such groups of 
correlated genes involved in a specific pathology is desirable to make progress in the 
understanding of the underlying biological mechanisms. 
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Motivated by inicroarray data analysis, Zou and Hastie [15] proposed the use of a 
penalty which is a weighted sum of the £i-norni and the square of the ^2-iiorni of the 
coefficient vector (3. The first term enforces the sparsity of the solution, whereas the second 
term ensures democracy among groups of correlated variables. In ^5] the corresponding 
method is called (naive) elastic net. The method allows to select groups of correlated 
features when the groups are not known in advance (algorithms to enforce group-sparsity 
with ^reassigned groups of variables have been proposed in e.g. [211 1121 [22] using other 
types of penalties). 

In the present paper we study several properties of the elastic-net regularization scheme 
for vector-valued regression in a random design. In particular, we prove consistency under 
some adaptive and non-adaptive choices for the regularization parameter. As concerns 
variable selection, we assess the accuracy of our estimator for the vector (3 with respect 
to the ^2-norm, whereas the prediction ability of the corresponding function /„ = fp^ 
is measured by the expected risk E [|F — fn{X)\ ]. To derive such error bounds, we 
characterize the solution of the variational problem underlying elastic-net regularization 
as the fixed point of a contractive map and, as a byproduct, we derive an explicit iterative 
thresholding procedure to compute the estimator. As explained below, in the presence 
of highly coUinear features, the presence of the ^2-penalty, besides enforcing grouped 
selection, is crucial to ensure stability with respect to random sampling. 

In the remainder of this section, we define the main ingredients for elastic-net regular- 
ization within our general framework, discuss the underlying motivations for the method 
and then outline the main results established in the paper. 

As an extension of the setting originally proposed in |15], we allow the dictionary to 
have an infinite number of features. In such case, to cope with infinite sums, we need 
some assumptions on the coefficients. We assume that the prediction function we have to 
determine is a linear combination of the features {iI)^)^^y in the dictionary and that the 
series 

converges absolutely for aA\x E X and for all sequences P = (/3-y)^gr satisfying X]7er ""7/^7 < 
00, where u^ are given positive weights. The latter constraint can be viewed as a con- 
straint on the regularity of the functions //? we use to approximate the regression function. 
For infinite-dimensional sets, as for example wavelet bases or splines, suitable choices of 
the weights correspond to the assumption that //3 is in a Sobolev space (see Section 2 
for more details about this point). Such requirement of regularity is common when deal- 
ing with infinite-dimensional spaces of functions, as it happens in approximation theory, 
signal analysis and inverse problems. 

To ensure the convergence of the series defining fp, we assume that 

(1) y Ihi^ is finite for all x e X. 

7er ^ 

Notice that for finite dictionaries, the series becomes a finite sum and the previous con- 
dition as well as the introduction of weights become superfiuous. 

To simplify the notation and the formulation of our results, and without any loss in 
generality, we will in the following rescale the features by defining ip^ = -^^/y^, so that 
on this rescaled dictionary, f^ = J^-yev P-y'^i ^^^^ ^e represented by means of a vector 
/3^ = y/u^/3y belonging to £2] the condition nl) then becomes ^^gp lv^7(3;)l < +00, for 
all X e X. From now on, we will only use this rescaled representation and we drop the 
tilde on the vector j3. 
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Let us now define our estimator as the minimizer of tlie empirical risk penalized with 
a (weighted) elastic-net penalty, that is, a combination of the squared ^2-norm and of a 
weighted £i-norm of the vector j3. More precisely, we define the elastic-net penalty as 
follows. 

Definition 1. Given a family (w^)^gr of weights w^ > and a parameter e > 0, let 

Ps : ^2 ^ [0, C)o] be defined as 



(2) p,(/5) = 5^K|/3^|+£/5: 



') 

ixr noTO II /Til — \ '>",,,, 



which can also be rewritten as Pe(/3) = ||/3||i^ + s ||/3||2, where ||/3||]^^ = T^^f^Y-w^\l3^ 

The weights w-^ allow us to enforce more or less sparsity on different groups of features. 
We assume that they are prescribed in a given problem, so that we do not need to explicitly 
indicate the dependence of Pe{.P) on these weights. The elastic-net estimator is defined 
by the following minimization problem. 

Definition 2. Given A > 0, let S^ : £2 -^ [0, +cxd] be the empirical risk penalized by the 

penalty p£(/3) 

1 '^ 

(3) £::{P) = - E 1^' - //3(^^)i' + ^p^(/^)' 

1=1 



and let P^ & £2 he the or a minimizer of (|3j) on £2 
(4) /3^ = argmin^„^(/5). 

The positive parameter A is a regularization parameter controlling the trade-off between 
the empirical error and the penalty. Clearly, j3^ also depends on the parameter e, but we 
do not write explicitly this dependence since e will always be fixed. 

Setting e = in (pi), we obtain as a special case an infinite-dimensional extension of 
the Lasso regression scheme. On the other hand, setting w^ = 0, V7, the method reduces 
to ^2-regularized least-squares regression - also referred to as ridge regression - with a 
generalized linear model. The £i-penalty has selection capabilities since it enforces sparsity 
of the solution, whereas the ^2-penalty induces a linear shrinkage on the coefficients leading 
to stable solutions. The positive parameter e controls the trade-off between the £i-penalty 
and the ^2-penalty. 

We will show that, if e > 0, the minimizer (3^ always exists and is unique. In the paper 
we will focus on the case e > 0. Some of our results, however, still hold for e = 0, possibly 
under some supplementary conditions, as will be indicated in due time. 

As previously mentioned one of the main advantage of the elastic-net penalty is that 
it allows to achieve stability with respect to random sampling. To illustrate this prop- 
erty more clearly, we consider a toy example where the (rescaled) dictionary has only 
two elements ipi and ip2 with weights Wi = W2 = 1. The effect of random sampling is 
particularly dramatic in the presence of highly correlated features. To illustrate this situ- 
ation, we assume that ifi and ip2 exhibit a special kind of linear dependency, namely that 
they are linearly dependent on the input data Xi, . . . , X„: ip2{Xi) = tan^„ ipi{Xi) for all 
i = 1, . . . ,n, where we have parametrized the coefficient of proportionality by means of 
the angle 6'„ G [0, 7r/2] . Notice that this angle is a random variable since it depends on 
the input data. 

Observe that the minimizers of (tsl) must lie at a tangency point between a level set of 
the empirical error and a level set of the elastic-net penalty. The level sets of the empirical 
error are all parallel straight lines with slope — cot 9n, as depicted by a dashed line in the 
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Figure 1. The e-ball with e > (sohd hne), the square (^i-ball), which is the 
e-ball with e = (dashed hne), and the disc (^2-ban), which is the e-ball with 
e — > cx) (dotted line). 



two panels of Figure [2} whereas the level sets of the elastic-net penalty are elastic-net 
balls (e-balls) with center at the origin and corners at the intersections with the axes, as 
depicted in Figure [Ij When e = 0, i.e. with a pure £i-penalty (Lasso), the e-ball is simply 
a square (dashed line in Figure [I| and we see that the unique tangency point will be the 
top corner if 9n > 7r/4 (the point T in the two panels of Figure |2]), or the right corner if 
6'n < 7r/4. For 6n = 7r/4 (that is, when (fi and (^2 coincide on the data), the minimizer 
of ([3]) is no longer unique since the level sets will touch along an edge of the square. Now, 
if 9n randomly tilts around 7r/4 (because of the random sampling of the input data), we 
see that the Lasso estimator is not stable since it randomly jumps between the top and 
the right corner. If £ — > 00, i.e. with a pure ^2-penalty (ridge regression), the e-ball 
becomes a disc (dotted line in Figure [I]) and the minimizer is the point of the straight 
line having minimal distance from the origin (the point Q in the two panels of Figure [2]). 
The solution always exists, is stable under random perturbations, but it is never sparse 
(if KOn <7r/2). 

The situation changes if we consider the elastic-net estimator with e > (the corre- 
sponding minimizer is the point P in the two panels of Figure |2]). The presence of the 
^2-term ensures a smooth and stable behavior when the Lasso estimator becomes unsta- 
ble. More precisely, let — cot 6^ be the slope of the right tangent at the top corner of the 
elastic-net ball (6*+ > vr/4), and — cot 6'_ the slope of the upper tangent at the right corner 
(^_ < vr/4). As depicted in top panel of Figure [21 the minimizer will be the top corner 
if 9n > 0^. It will be the right corner if 6'„ < 9-. In both cases the elastic-net solution 
is sparse. On the other hand, if ^_ < 6n < 6+ the minimizer has both components (3i 
and P2 different from zero - see the bottom panel of Figure |2j in particular, /3i = j32 if 
6'n = 7r/4. Now we observe that if 6'„ randomly tilts around 7r/4, the solution smoothly 
moves between the top corner and the right corner. However, the price we paid to get 
such stability is a decrease in sparsity, since the solution is sparse only when 9n ^ [0-, 6*+]. 
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Figure 2. Estimators in the two-dimensional example: T=Lasso, P=elastic 
net and Q=ridge regression. Top panel: 9j^ < 9 < 7r/2. Bottom panel: 7r/4 < 

e <e+. 

The previous elementary example could be refined in various ways to show the essential 
role played by the £2-penalty to overcome the instability effects inherent to the use of the 
£i-penalty for variable selection in a random-design setting. 

We now conclude this introductory section by a summary of the main results which 
will be derived in the core of the paper. A key result will be to show that for e > 0, /5^ 
is the fixed point of the following contractive map 



/? 



-SA((r/ -$:$„)/? +$:r) 






/?7 



> 
< 



XW-y 



where r is a suitable relaxation constant, $*$« is the matrix with entries 
^ Er=i {^,iX^), ^y (X.)), ^Y is the vector ($;F), = i ELi {^,iXi),Y^ ((-, •) denotes 
the scalar product in the output space 3^). Moreover, S^ (/5) is the soft-thresholding 
operator acting componentwise as follows 

r /3, - ^ if 

[Sa (/?)], = <^ if 

As a consequence of the Banach fixed point theorem, f3^ can be computed by means of an 
iterative algorithm. This procedure is completely different from the modification of the 
LARS algorithm used in [IS] and is akin instead to the algorithm developed in [14j . 



if P-y < - 



XWry 
2 
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Another interesting property which we will derive from the above equation is that the 
non-zero components of P^ are such that w^ < j, where C is a constant depending on 
the data. Hence the only active features are those for which the corresponding weight lies 
below the threshold C/X. If the features are organized into finite subsets of increasing 
complexity (as it happens for example for wavelets) and the weights tend to infinity with 
increasing feature complexity, then the number of active features is finite and can be 
determined for any given data set. Let us recall that in the case of ridge regression, the 
so-called representer theorem, see [H], ensures that we only have to solve in practice a 
finite-dimensional optimization problem, even when the dictionary is infinite-dimensional 
(as in kernel methods). This is no longer true, however, with an £i-type regularization 
and, for practical purposes, one would need to truncate infinite dictionaries. A standard 
way to do this is to consider only a finite subset of m features, with m possibly depending 
on n - see for example [8, J4j. Notice that such procedure implicitly assumes some order 
in the features and makes sense only if the retained features are the most relevant ones. 
For example, in [3], it is assumed that there is a natural exhaustion of the hypothesis space 
with nested subspaces spanned by finite-dimensional subsets of features of increasing size. 
In our approach we adopt a different strategy, namely the encoding of such information 
in the elastic-net penalty by means of suitable weights in the £i-norm. 

The main result of our paper concerns the consistency for variable selection of (3^. We 
prove that, if the regularization parameter A = A„ satisfies the conditions lim„^oo A„ = 
and lim„^oo(AnV^ ~ 21og?2) = +oo, then 

lim 11/3^" — /5^|| = with probability one, 

where the vector /3^, which we call the elastic-net representation of fp, is the minimizer 
of 

+ e ^ |/?^n subject to //? = /*. 

76r / 

The vector /?^ exists and is unique provided that e > and the regression function 
/* admits a sparse representation on the dictionary, i.e. /* = ^^gp (^"^Vi fo'^ ^it least 
a vector (3* G £2 such that ^^gr ""^71/^71 i^ finite. Notice that, when the features are 
linearly dependent, there is a problem of identifiability since there are many vectors (3 
such that /* = fp. The elastic-net regularization scheme forces [3^" to converge to /5^. 
This is precisely what happens for linear inverse problems where the regularized solution 
converges to the minimum-norm solution of the least-squares problem. As a consequence 
of the above convergence result, one easily deduces the consistency of the corresponding 
prediction function /„ := La„, that is, lim„_,ooIE [|/„ ~ /*| ] =0 with probability one. 
When the regression function does not admit a sparse representation, we can still prove 
the previous consistency result for /„ provided that the linear span of the features is 
sufficiently rich. Finally, we use a data-driven choice for the regularization parameter, 
based on the so-called balancing principle |S], to obtain non-asymptotic bounds which are 
adaptive to the unknown regularity of the regression function. 

The rest of the paper is organized as follows. In Section 2, we set up the mathematical 
framework of the problem. In Section 3, we analyze the optimization problem underlying 
elastic-net regularization and the iterative thresholding procedure we propose to compute 
the estimator. Finally, Section 4 contains the statistical analysis with our main results 
concerning the estimation of the errors on our estimators as well as their consistency prop- 
erties under appropriate a priori and adaptive strategies for choosing the regularization 
parameter. 
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2. Mathematical setting of the problem 

2.1. Notations and assumptions. In this section we describe tlie general setting of the 
regression problem we want to solve and specify all the required assumptions. 

We assume that Af is a separable metric space and that 3^ is a (real) separable Hilbert 
space, with norm and scalar product denoted respectively by | ■ | and (■, ■). Typically, X 
is a subset of M'^ and y is M. Recently, however, there has been an increasing interest for 
vector- valued regression problems [221 E] and multiple supervised learning tasks [3U1 12] : in 
both settings y is taken to be M™. Also infinite-dimensional output spaces are of interest 
as e.g. in the problem of estimating of glycemic response during a time interval depending 
on the amount and type of food; in such case, y is the space l? or some Sobolev space. 
Other examples of applications in an infinite-dimensional setting are given in [11] . 

Our first assumption concerns the set of features. 

Assumption 1. The family of features (v9^)^gr is a countable set of measurable functions 
ip^ : X ^ y such that 

(5) Vx e A' k{x) = Y^ \^-y{x)f < K, 

for some finite number k. 

The index set F is countable, but we do not assume any order. As for the convergence 
of series, we use the notion of summability: given a family {v^Xy^r of vectors in a normed 
vector space V, v = J2-y(^r'^y means that (f^)^gr is summabla^ with sum v eV. 

Assumption 1 can be seen as a condition on the class of functions that can be recovered 
by the elastic-net scheme. As already noted in the Introduction, we have at our disposal 
an arbitrary (countable) dictionary ('0^)^gr of measurable functions, and we try to ap- 
proximate /* with linear combinations fp^x) = J^-yi^r P-yi^-yi.^) where the set of coefficients 
(/3-y)-ygr satisfies some decay condition equivalent to a regularity condition on the functions 
fl3. We make this condition precise by assuming that there exists a sequence of positive 

weights (M^)^gr such that J^yer'^if^^ < ^^ ^^*^' fo^ ^^y of s^'^^ vectors /3 = (/3-^)-^gr, that 
the series defining /^ converges absolutely for all x G X. These two facts follow from 

the requirement that the set of rescaled features (p^ = ^^ satisfies X]7er lv'7(^)P < ^^■ 
Condition (p) is a little bit stronger since it requires that sup^g;^. ^■y& \^yi^)\'^ < ^^) ^o 
that we alsonave that the functions fp are bounded. To simplify the notation, in the rest 
of the paper, we only use the (rescaled) features ip^ and, with this choice, the regularity 
condition on the coefficients (/3^)-^gr becomes ^^gr/^7_< '^• 

An example of features satisfying the condition ([s]) is given by a family of rescaled 
wavelets on A" = [0, 1]. Let {ipjk I J = 0, 1 ... ; A; G A^} be a orthonormal wavelet basis 
in L'^{[0, 1]) with regularity C^, r > ^, where for j > 1 {'ipjk | A; € Aj} is the orthonormal 
wavelet basis (with suitable boundary conditions) spanning the detail space at level j. To 
simplify notation, it is assumed that the set {ipok I A; G Aq} contains both the wavelets 
and the scaling functions at level j = 0. Fix s such that ^ < s < r and let pjk = 2~^'^ipji:. 



< rj for all finite 

V 



That is, for all t] > 0, there is a finite subset Fq C F such that v — X]-,gr' ^7 

subsets F' D Fq. If F = N, the notion of summability is equivalent to requiring the series to converge 
unconditionally (i.e. its terms can be permuted without affecting convergence). If the vector space 
is finite-dimensional, summability is equivalent to absolute convergence, but in the infinite-dimensional 
setting, there are summable series which are not absolutely convergent. 
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Then 



Yl Yl I'fijkix) 
j=o fceA, 



j=o fceA, ■ " 



-2s 



K, 



j=Q 



where C is a suitable constant depending on the number of wavelets that are non-zero at 
a point X G [0, 1] for a given level j, and on the maximum values of the scaling function 
and of the mother wavelet; see [1] for a similar setting. 

Condition (Is]) allows to define the hypothesis space in which we search for the estimator. 
Let £2 be the Hilbert space of the families (/3^)-ygr of real numbers such that X]7er /^7 < ^^' 



with the usual scalar product (■, ■)2 and the corresponding norm 



12- 



We will denote by 



(e^)^gr the canonical basis of £2 and by supp(/5) = {7 G F | /3^ 7^ 0} the support of /5. The 
Cauchy-Schwarz inequality and the condition (|5| ensure that, for any /3 = (/9-Y)7er £ £2, 
the series 



YPlV^lix) = fl3{x) 



7er 



is summable in y uniformly on X with 



(6) 



SUp|//3(x)| < 
xeX 



1 
K2. 



Later on, in Proposition pi we will show that the hypothesis space Ti, = {fp \ (3 G £2} is 
then a vector- valued reproducing kernel Hilbert space on X with a bounded kernel |12j . 
and that (y9^)^gr is a normalized tight frame for 7i. In the example of the wavelet features 
one can easily check that H is the Sobolev space H^ on [0, 1] and \\/3\\2 is equivalent to 

\\M\h^- 
The second assumption concerns the regression model. 

Assumption 2. The random couple (X, Y) in X x y obeys the regression model 

Y = f*{X) + W 

where 

(7) /* = fi3* for some P* G £2 with ^i(;^|/5*| < +00 

7er 

and 



E [ly I X] = 



(9) 



E 



exp 



\W\ 

T 



\w\ 



X 



< 



2Z2 



with cr, L > 0. The family (w^)^gr are the positive weights defining the elastic-net penalty 

p,(/?) in g. 

Observe that /* = fp* is always a bounded function by (|6|. Moreover the condition ([T]) 
is a further regularity condition on the regression function and will not be needed for some 
of the results derived in the paper. Assumption ^ is satisfied by bounded, Gaussian or 
sub-Gaussian noise. In particular, it implies 



(10) 



E[|H^r|A] < -mla^L'"-^, 



Vm > 2, 



see [1^, so that W has a finite second moment. It follows that Y has a finite first moment 
and (Is]) imphes that /* is the regression function E [F | A = x]. 
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Condition dTb controls both the sparsity and the regularity of the regression function. 
If inf^grw^ = wq > 0, it is sufficient to require that ||/3*||i^„ is finite. Indeed, the Holder 
inequality gives that 

(11) 11/^112 <^ll/?lll,.- 

Wq 

If Wq = 0, we also need ||/3*||2 to be finite. In the example of the (rescaled) wavelet 
features a natural choice for the weights is Wjk = 2-^" for some a G M, so that ||/5||j^^ is 
equivalent to the norm ||//3||^s , with 5 = a + s + ^, in the Besov space B^ ^ on [0, 1] (for 

more details, see e.g. the appendix in [Ej). In such a case, ([T]) is equivalent to requiring 
that f* eH'nBl-^. 

Finally, our third assumption concerns the training sample. 

Assumption 3. The sequence of random pairs (X„, Yn)n>i are independent and identi- 
cally distributed (i.i.d.) according to the distribution of {X,Y). 

In the following, we let P be the probability distribution of (X, Y), and Ly{P) be the 
Hilbert space of (measurable) functions f '■ X ^y ^ y with the norm 

ll/llp= / \f{xM' dP{x,y). 

With a slight abuse of notation, we regard the random pair (X, Y) as a function on A* x 3^, 
that is, X{x,y) = x and Y{x,y) = y. Moreover, we denote by P„ = ^^^^i^x^.y^ the 
empirical distribution and by LyiFn) the corresponding (finite-dimensional) Hilbert space 
with norm 

1 " 

i=l 

2.2. Operators defined by the set of features. The choice of a quadratic loss function 
and the Hilbert structure of the hypothesis space suggest to use some tools from the 
theory of linear operators. In particular, the function fp depends linearly on j3 and can 
be regarded as an element of both Ly{P) and of L^(P„). Hence it defines two operators, 
whose properties are summarized by the next two propositions, based on the following 
lemma. 

Lemma 1. For any fixed x ^ X , the map $^. : £2 ^ y defined by 

^x(3 = ^¥?^(x)/3^ = fpix) 

is a Hilbert- Schmidt operator, its adjoint $*, : 3^ — > £2 o-cts as 

(12) {^ly), = {y,ip,{x)) 7er yey. 
In particular $*$a: is a trace-class operator with 

(13) Tr ($:.$,) < K. 
Moreover, (^*xY is a i2-valued random variable with 

(14) \\^*xY\\,<K^\Yl 
and ^*x^x is a Cns-valued random variable with 

(15) \\^*x^x\\ns<^^ 

where Chs denotes the separable Hilbert space of the Hilbert- Schmidt operators on £2, o,nd 
IMItto is the Hilbert- Schmidt norm. 

11 II rlo 
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Proof. Clearly ^^ is a linear map from £2 to 3^. Since ^x^-y = V-yi^)) ^^ have 

7er 7er 

so that ^x is a Hilbert-Schmidt operator and Tr ($*.$2:) < /« by (tsl). Moreover, given 
y G y and 7 G r 



which is (12). Finally, since X and y are separable, the map (x,?/) ^ {y,ipy{x)) is 
measurable, then ($5s:^)7 is a real random variable and, since £2 is separable, $3^:^ i^ 
^2-valued random variable with 

7er 
A similar proof holds for ^*x^x, recalling that any trace-class operator is in £hs and 

The following proposition defines the distribution-dependent operator $p as a map 
from £2 into Ly{P). 

Proposition 1. The map $p : £2 -^ Ly{P), defined by $p/3 = fjs, is a Hilbert-Schmidt 
operator and 

(16) (^*pY = E[^*xY] 

(17) ^*p^p = E[$3^<l>x] 

(18) Tr($^$p) = K[k{X)]<K. 

Proof. Since //? is a bounded (measurable) function, /^ G Ly{P) and 

7er 7er 



Hence $p is a Hilbert-Schmidt operator with Tr ($p$p) = J^jer W^P^illp ^^ that (18) 
holds. By ^ W has a finite second moment and by ([6]) /* = ffj* is a bounded function, 
hence Y = f*{X) + ly is in Ly{P). Now for any /3 G £2 we have 

($^y, /?)2 = (F, <l>p/3)p = E [(F, $x/5)] = E [($;,y, (3),] . 



On the other hand, by (14), ^*xY has finite expectation, so that (16) follows. Finally, 
given /3, /3' G £2 

(<l>^$p/3', /3)2 = ($P/?', $p/?)p = E [(<|.x/?', ^xP)] = E [($:^<l>x/3', /?)2] 



so that (17) is clear, since ^*x^x has finite expectation as a consequence of the fact that 



it is a bounded /^Hs-valued random variable. D 

Replacing P by the empirical measure we get the sample version of the operator. 
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ffl is Hilbert- Schmidt 



Proposition 2. 

operator and 


The 


map 


$n 


: £2 -^ i^y(Pn) defined by $„/? 


(19) 








1 " 


(20) 










(21) 






Tr 


($:$„) = i>>(x,)<«. 

1 L 



i=l 

The proof of Proposition [2] is analogous to the proof of Proposition [TJ except that P is 
to be replaced by P^. 

By (12) with y = (y9y(x), we have that the matrix elements of the operator $*$x are 
(<l>*$^)^y = {{Py{x), f^-yix)) so that $*$n is the empirical mean of the Gram matrix of the 
set (y9^)-ygr, whereas (^*p^p is the corresponding mean with respect to the distribution P. 
Notice that if the features are linearly dependent in Ly(P„), the matrix $*$« has a non- 
trivial kernel and hence is not invertible. More important, if F is countably infinite, $J^$n 
is a compact operator, so that its inverse (if it exists) is not bounded. On the contrary, 
if r is finite and (v9^)^gr are linearly independent in L^(P„), then $*$« is invertible. A 
similar reasoning holds for the matrix <l>p$p. To control whether these matrices have a 
bounded inverse or not, we introduce a lower spectral bound kq > 0, such that 

and, with probability 1, 

/3e^2|||/3|l2=i' 

Clearly we can have kq > only if F is finite and the features (y9^)^gr are linearly 
independent both in L y(P ^) and Ly{P). 

On the other hand, (18) and (21) give the crude upper spectral bounds 

sup ($^$p/3,/3)2 < /€, 

/36<?2|||/3||2=1 

sup ($;<l'„/3,/3)2 < K. 

/56fe|||/3||2=l 

One could improve these estimates by means of a tight bound on the largest eigenvalue 

of (^*p^p. 

We end this section by showing that, under the assumptions we made, a structure of 
reproducing kernel Hilbert space emerges naturally. Let us denote by y"^ the space of 
functions from X to 3^. 

Proposition 3. The linear operator $ : £2 ~^ y"^ , ^P = ff3, is a partial isometry from £2 
onto the vector-valued reproducing kernel Hilbert space Ti on X , with reproducing kernel 
K:XxX^ C{y) 

(22) K{x, t)y = {<^.^)y = J2 ^^li^) iv^ ^li^)) x,teX, yey, 

the null space of $ is 

(23) ker $ = {/5 G £2 I J] <^7(^)'^7 = Vx G X}, 

7er 
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and the family (y9^)^gr is a normalized tight frame in Ti, namely 

7er 

Conversely, let Ti he a vector-valued reproducing kernel Hilbert space with reproducing 
kernel K such that K{x,x) : y ^ y is a trace-class operator for all x & X, with trace 
hounded hy k. If {Lp^)^^r is a normalized tight frame in 7Y, then (^ holds. 

Proof Proposition 2.4 of [12] (with IC = y,H = £2, 7(0;) = $* and A = $) gives that $ is 
a partial isometry from £2 onto the reproducing kernel Hilbert space Ti, with reproducing 



kernel K(x^t). Eq. (23) is clear. Since $ is a partial isometry with range Ti. and $e^ = 1^^ 
where (e^)^gr is a basis in £2, then {{p^)^^r is normalized tight frame in H. 
To show the converse result, given x E X and y E y, we apply the definition of a 
normalized tight frame to the function K^y defined by {Kxy){t) = K{t,x)y. K^y belongs 
to Ti by definition of a reproducing kernel Hilbert space and is such that the following 
reproducing property holds (/, i^^^?/)^ = {f{x),y) for any f EH. Then 

{K{x, x)y, y) = \\K^yf^ = ^ | {K^y, (p.,)^ P = X] ' ^^' '^7(^)) P' 

7er 7er 

where we used twice the reproducing property. Now, if {yi)iei is a basis in y and x ^ X 

^\>p^[x)f = ^^1 (|/i,(^^(x))p = ^{K{x,x)yi,yi) = Tr{K{x,x)) < k. 
7er 7er ie/ ie/ 

D 



3. Minimization of the elastic-net functional 

In this section, we study the properties of the elastic net estimator P^ defined by (E|. 
First of all, we characterize the minimizer of the elastic-net functional ^ as the unique 
fixed point of a contractive map. Moreover, we characterize some sparsity properties of 
the estimator and propose a natural iterative soft-thresholding algorithm to compute it. 
Our algorithmic approach is totally different from the method proposed in |15], where 
P^ is computed by first reducing the problem to the case of a pure ii penalty and then 
applying the LARS algorithm |20j . 

In the following we make use the of the following vector notation. Given a sample of n 
i.i.d. observations (Xi, Yi), . . . , (X„, F„), and using the operators defined in the previous 
section, we can rewrite the elastic-net functional ^ as 

(24) £^{P) = \\^nP-Y\\l + \p,iP), 

where the Pe{-) is the elastic net penalty defined by (|2|. 



3.1. Fixed point equation. The main difficulty in minimizing (24) is that the functional 
is not differentiable because of the presence of the £i-term in the penalty. Nonetheless 
the convexity of such term enables us to use tools from subdifferential calculus. Recall 
that, if F : £2 — * I^ is a convex functional, the subgradient at a point /3 G £2 is the set of 
elements 77 G £2 such that 

F{l3 + (3')>F{(3) + {r^,l3'), V/?' G 4. 
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The subgradient at /? is denoted by dF{P), see [21]. We compute the subgradient of the 
convex functional PeiP), using the following definition of sgn(t) 

' sgn(t) = 1 if t > 

(25) I sgn(t) G [-1,1] ift = 

^ sgn(t) = -1 if t < 0. 

We first state the following lemma. 

Lemma 2. The functional Pe{-) is a convex, lower semi- continuous (l.s.c.) functional 
from (.2 into [0, oo]. Given (3 & (.2, cl vector rj G dps{f3) if and only if 

rjj = w^sgn.{(3^) + 2e(3^ V7 G F and Y^ rj"^ < +cxo. 

7er 

Proof. Define the map F : F x M ^ [0, 00] 

F(7,t) = w^\t\ +et^. 

Given 7 G F, -^(7, ■) is a convex, continuous function and its subgradient is 

5F(7, t) = {r G M I r = w^sgn(t) + 2et}, 

where we used the fact that the subgradient of |t| is given by sgn(t). Since 

p,(/5) = ^F(7,/5,)= sup 5^F(7,/5,) 
7er r' finite ^gp, 

and /3 I— >■ /3^ is continuous, a standard result of convex analysis pi] ensures that Pe{') is 
convex and lower semi-continuous. 

The computation of the subgradient is standard. Given /9 G £2 and tj G dps{P) C £2, by 
the definition of a subgradient, 

J2 ^(7, P, + P',)>Y1 ^(^' /^-r) + E ^7/?; V/3' G £2. 
7er 7er 7er 

Given 7 G F, choose (3' = te^ with t G M, it follows that r/^ belongs to the subgradient of 
F(7,/3-y), that is, 

(26) r]^ = WjSgn{(3^) + 2e(3^. 



Conversely, if (26) holds for all 7 G F, by definition of a subgradient 

By summing over 7 G F and taking into account the fact that (?7-y/3^)^gr G ii, then 

PeiP + P')>Pe{P) + {V,P')2- 

D 



To state our main result about the characterization of the minimizer of (24), we need 
to introduce the soft-thresholding function iSa : M — ;> M, A > which is defined by 

t - f if t > I 

(27) Sx it)={ if |t| < I , 

t + l if t<-| 

and the corresponding nonlinear thresholding operator Sa : ^2 — * ^2 acting componentwise 
as 

(28) [Sa my = '^A^. iPy) ■ 
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We note that the soft-thresholding operator satisfies 

(29) SaxiaP) = aSxilS) a>0,/3G£2, 

(30) \\Sxi/3)-Sxi(3')\\2 < \\/3-(3'\\, (3,(3' e £2. 
These properties are immediate consequences of the fact that 

Sax (at) = aSx (t) a > 0, t G M 

\Sx{t)-Sx{t')\ < \t-t'\ t,t' eR. 



Notice that ([30]) with (3' = ensures that Sa {(3) G £2 for all (3 e £2- 
We are ready to prove the following theorem. 

Theorem 1. Given e > and X > 0, a vector (3 & £2 is a minimizer of the elastic-net 
functional ^ if and only if it solves the nonlinear equation 

1 " A 

(31) n S ^^* " ('^n/3)(Xi), ^,{X,)) - eX(3, = -w^sgn{(3,) V7 G T, 

or, equivalently, 

(32) P = Sx ((1 - eX)P + KiY - $n/5)) • 

Ife > the solution always exists and is unique. If e = 0, kq > andwQ = inf^gr'W^7 > 0, 
the solution still exists and is unique. 

Proof. If £ > the functional S^ is strictly convex, finite at 0, and it is coercive by 

Sm>Pem>)^e\\(3\\l 

Observing that ||$„/3 — F||„ is continuous and, by Lemma ^ the elastic-net penalty is 
l.s.c, then S^ is l.s.c. and, since £2 is refiexive, there is a unique minimizer (3^ in £2. If 
e = 0, S^ is convex, but the fact that kq > ensures that the minimizer is unique. Its 
existence follows from the observation that 



^„"(/?)>P.(/?)>A||/3||, >A^o 



2 ' 



where we used ( 11 ). In both cases the convexity of £^ implies that /? is a minimizer if and 
only if G d£^{(3). Since \\^n(3 — Y\\n is continuous. Corollary III. 2.1 of [2T] ensures that 
the subgradient is linear. Observing that ||$„/3 — y||^ is differentiable with derivative 
2$;<l>„/3 - 2$;F, we get 

d8^{(3) = 2$:$,/? - 2$:F + Xdp.iP). 



Eq. (31) follows taking into account the explicit form of dpe{(3), <l>*$„/3 and $*y, given 

by Lemma [2] and Proposition [2| respectively. 

We now prove (32), which is equivalent to the set of equations 



(33) P^ = S, 



Xw^, 



({1 - eX)(3^ + ^ E (^. - ($„/3)(X,),^^(X,)) j V7 G r. 



Setting p'^= {Y - $„/3, v9^(X))^ - eXP^, we have P^ = Sxw., {P-y + P'^) if and only if 



P^ + P'^-^ if P, + P'^>^ 



XW-y 



P^={ if \P, + P'^\< 2 , 

P^ + P'+^ if Py + P'< 



■^7 ' A^7 2 
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that is, 

if /3-y = or else (3'^ = — ^ sgn(/5^) 
if /3^ < 
which is equivalent to (31). D 

The following corollary gives some more information about the characterization of the 
solution as the fixed point of a contractive map. In particular, it provides an explicit 
expression for the Lipschitz constant of this map and it shows how it depends on the 
spectral properties of the empirical mean of the Gram matrix and on the regularization 
parameter A. 

Corollary 1. Let e > and A > 0. Pick up any arbitrary r > 0. Then j3 is a minimizer 
of S^ in £2 if and only if it is a fixed point of the following Lipschitz map Tn : £2 ^ h, 
namely 

(34) (3 = %(3 where XP = ^--Sx{{rl - K^n)P + ^Y) . 

T + e\ 

With the choice t = ^^^^y^, the Lipschitz constant is bounded by 

q = < 1. 

K + kq + 2eX 

In particular, with this choice of t and if e > Q or kq > Q, Tn is a contraction. 

Proof. Clearly /5 is a minimizer of £^ if and only if it is a minimizer of :;rr^^;^, which 
means that, in (32L we can replace A with ;;r^, $« by , ^ $„ and Y by -i==fY . Hence 
/? is a minimizer of £^ if and only if it is a solution of 

/? = S . f (1 - -^)(3 + -^K{y - <^n/3) 
^+e^ V T + sX T + eX 



Therefore, by (29) with a = :;rr^, /5 is a minimizer of £^ if and only if /3 = 7^/3. 

We show that T^is Lipschitz and calculate explicitly a bound on the Lipschitz constant. 

By assumption we have kqI < ^n^n < ^^I] then, by the Spectral Theorem, 

\\tI - $:$n||,^^,^ < max{|r - Ko\, \r - k\}, 



where ||-|| denotes the operator norm of a bounded operator on £2- Hence, using (30), 
we get 

The minimum of q with respect to r is obtained for 

T — Kq k — t 



T + eX r + eA' 
that is, r = ^^^^, and, with this choice, we get 

q = . 

K + kq + 2eX 



D 



ELASTIC-NET REGULARIZATION IN LEARNING THEORY 17 

By inspecting the proof, we notice that the choice of r = ^^^^-bs provides the best possible 
Lipschitz constant under the assumption that kqI < $* $n < i^I- If £ > or kq > 0, 7^ 
is a contraction and (5^ can be computed by means of the Banach fixed point theorem. 
If £ = and Ko = 0, 7^ is only non-expansive, so that proving the convergence of the 
successive approximation scheme is not straightforwarcQ 

Let us now write down explicitly the iterative procedure suggested by Corollary [T] to 
compute (3^. Define the iterative scheme by 

/3° = 0, 

/3' = ^^SA((r/-<i.:$„)/5^-i + $:r) 

T + eX 

with r = ^^^2±^. The following corollary shows that the j3^ converges to (3^ when £ goes to 
infinity. 



Corollary 2. Assume that e > or kq > 0. For any £ G N the following inequality holds 

[K + Ko + 2e\yiKo + eX) 



Ci^) ||/7^_/7A|| < v^^ ^0^ ||<|.*y|| 



In particular, lim^^oo ||/^^ ~ /^n||2 ~ 0- 

Proof. Since 7^ is a contraction with Lipschitz constant q = ^ /^^^'^^^ < 1, the Banach 
fixed point theorem applies and the sequence {P^)^^^ converges to the unique fixed point 
of Tn, which is (3^ by Corollary [I] Moreover we can use the Lipschitz property of 7^ to 

write 

< q\\P'-'-f3% + q\\P'-(3^\l 

< q'\\P'-P% + q\\P'-P^\\, , 
so that we immediately get 



\d'-d^\\ <^^\\3'-d^\\ < ij^J^ u^*Y\ 



2 



Since /3° = 0, P' = ^S, {^Y) and 1 - g = |fgg,. D 



Let us remark that the bound (35) provides a natural stopping rule for the number of 
iterations, namely to select i such that ||/9^ — /9^||2 < ''?, where 77 is a bound on the distance 
between the estimator P^ and the true solution. For example, if ||$*y||2 is bounded by 
M and if kq = 0, the stopping rule is 

log^ 

P^ > '^ so that 1 1 /?^=*°p — Z?-^ 1 1 <n 

<-stop ^ , , 2eX\ \\f^ f^nWo — 'r 

log(l + ^) 

Finally we notice that all previous results also hold when considering the distribution- 
dependent version of the method. The following proposition summarizes the results in 
this latter case. 



Interestingly, it was proved in 114] using different arguments that the same iterative scheme can still 
be used for the case £ = and Kq = 0- 
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Proposition 4. Let e > and A > 0. Pick up any arbitrary r > 0. Then a vector /? G £2 
is a minimizer of 

£\P)=E[\^pP-Yf]+\p,{P). 
if and only if it is a fixed point of the following Lipschitz map, namely 

(36) (3 = T(3 where Tf] = — ^-—Sx {{tI - $p<l>p)/3 + <^*pY) . 

If e > or Kq > 0, the minimizer is unique. 
If it is unique, we denote it by P^: 

(37) /3^ = argniin (E [\^pI3 - Yf] + Xpe{l3)) . 

We add a comment. Under Assumption [2] and the definition of P^, the statistical model 
is y = $p/?^ + W where W has zero mean, so that f3^ is also the minimizer of 

(38) mi{\\^pl3-^pl3% + XpM)- 

3.2. Sparsity properties. The results of the previous section immediately yield a crude 
estimate of the number and localization of the non-zero coefficients of our estimator. 
Indeed, although the set of features could be infinite, j3^ has only a finite number of 
coefficients different from zero provided that the sequence of weights is bounded away 
from zero. 

Corollary 3. Assume that the family of weights satisfies inf^gr w^ > 0, then for any 
(3 G ^2; the support of S\ (/3) is finite. In particular, (3^, P^ and P^ are all finitely 
supported. 

Proof. Let Wq = inf^gr'W^7 > 0. Since J2yer 1/^7^ ^ +oc, there is a finite subset Fq C F 
such that \P-y\ < |wo < ^w-y for all 7 ^ Fq. This implies that 

Sxw-, iP-r) = for 7 ^ Fo, 
by the definition of soft-thresholding, so that the support of Sa (/3) is contained in Fq. 



Equations (32), (36) and the definition of /3^ imply that P^, P^ and P^ have finite support. 

D 

However, the supports of P^ and P^ are not known a priori and to compute P^ one 
would need to store the infinite matrix $* $«• The following corollary suggests a strategy 
to overcome this problem. 

Corollary 4. Given e >0 and A > 0, let 

Fa = |7 G F I 11^9^11^ ^ and Wy < 1 

then 

(39) supp(/3^) C Fa. 



Proof. If ||v57l|„ = 0, clearly /5^ = is a solution of (31). Let M = ||1^||„; the definition of 



P^ as the minimizer of (^ yields the bound £^{Pn) < ^n(O) = M^, so that 

\\<l>nP^-Y\l<M p^{p^)<—. 

Hence, for all 7 G F, the second inequality gives that e\{P^)^ < M^, and we have 
I (Y - $„/3;^, v9,(X)>,^ - 5A(/3„^),| < Af (||^,||„ + v^) 
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and, therefore, by (31) 



i,,((,.„i<?5(ii-^t-^^^) 



Alf-y 



2M(Hy^||,^+V£A) 



Since \sgn{{(3^)^)\ = 1 when {(3^)^ ^ 0, this imphes that (p^)^ = if ""^^"7;;^"'^^""^ < 
1. "''' D 

Now, let r' be the set of indexes 7 such that the corresponding feature ip^{Xi) ^ for 
some i = 1, . . . , n. If the family of corresponding weights (w^)^gr' goes to infinitj|^ then 
Fa is always finite. Then, since supp(/3^) C Fa, one can replace F with Fa in the definition 
of $n so that ^n^n is a finite matrix and $* F is a finite vector. In particular the iterative 
procedure given by Corollary [T] can be implemented by means of finite matrices. 

Finally, by inspecting the proof above one sees that a similar result holds true for the 
distribution-dependent minimizer (3^. Its support is always finite, as already noticed, and 
moreover is included in the following set 

, ^p .. , ^ 2 nip (11(^,11^ +v/^) ^ 

|7 G 1 I ||V57l|p 7^ U and w^ < 1. 

A 



4. Probabilistic error estimates 

In this section we provide an error analysis for the elastic-net regularization scheme. 
Our primary goal is the variable selection problem, so that we need to control the error 
\\Pn" ~ '^112' where A„ is a suitable choice of the regularization parameter as a function of 
the data, and (3 is an explanatory vector encoding the features that are relevant to recon- 
struct the regression function /*, that is, such that /* = $p/3. Although Assumption ([T]) 
implies that the above equation has at least a solution (3* with PeiP*) < cxo, nonetheless, 
the operator $p is injective only if {{p^{X))^^r is £2-linearly independent in Ly{P). As 
usually done for inverse problems, to restore uniqueness we choose, among all the vectors 
(3 such that /* = $p/?, the vector /?^ which is the minimizer of the elastic-net penalty. 
The vector /?^ can be regarded as the best representation of the regression function /* 
according to the elastic-net penalty and we call it the elastic-net representation. Clearly 
this representation will depend on e. 

Next we focus on the following error decomposition (for any fixed positive A), 

(40) \\^n-P%<\\Pn-(^%+\\(3'-P%, 



where P is given by (37). The first error term in the right-hand side of the above 



inequality is due to finite sampling and will be referred to as the sample error, whereas 



the second error term is deterministic and is called the approximation error. In Section |43 
we analyze the sample error via concentration inequalities and we consider the behavior 
of the approximation error as a function of the regularization parameter A. The analysis 
of these error terms leads us to discuss the choice of A and to derive statistical consistency 



results for elastic-net regularization. In Section |4.3| we discuss a priori and a posteriori 
(adaptive) parameter choices. 



The sequence (w-y)^gr' goes to infinity, if for all M > there exists a finite set Tm such that \wy\ > M, 
V7 ^ Tm. 
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4.1. Identifiability condition and elastic-net representation. The following propo- 
sition provides a way to define a unique solution of the equation /* = $p/3. Let 

i3 = {/5 G £2 I $p/9 = f*{X)} = /3* +ker$p 

where P* G £2 is given by ([t]) in Assumption [2] and 

ker $p = {/? G £2 I ^pP = 0} = {/5 G £2 I //3(X) = with probability 1}. 

Proposition 5. If e > or kq > 0, there is a unique (3'^ G £2 such that 

(41) p,(/3^) = infp,(/3). 

Proof. If kq > Q, B reduces to a single point, so that there is nothing to prove. If £ > 0, 
S is a closed subset of a reflexive space. Moreover, by Lemma |2l the penalty Pe{-) is 
strictly convex, l.s.c. and, by ([t]) of Assumption |2| there exists at least one P* E B such 
that PeiP*) is finite. Since Pe(/3) > e ||/3||2, Pe{-) is coercive. A standard result of convex 
analysis implies that the minimizer exists and is unique. D 

4.2. Consistency: sample and approximation errors. The main result of this sec- 
tion is a probabilistic error estimate for ||/5^ — /^'^H^, which will provide a choice A = A„ 
for the regularization parameter as well as a convergence result for 11/3^" ~ -^^IL' 

We first need to establish two lemmas. The first one shows that the sample error can 
be studied in terms of the following quantities 

(42) ||$:<|.„-<|.^<|.p||hs and \\KW\\, 

measuring the perturbation due to random sampling and noise (we recall that ||-||hs 
denotes the Hilbert- Schmidt norm of a Hilbert-Schmidt operator on £2)- The second 
lemma provides suitable probabilistic estimates for these quantities. 

Lemma 3. Let e > and A > 0. If e > or kq > 0, then 

(43) ||/?^ - P% < ^^-^ (||($:<l>„ - $^$p)(/3^ - P^)\\^ + WKWh) ■ 



Proof. Let r = ^^^^^ and recall that P^ and P satisfy (34) and (36), respectively. Taking 
into account (30) we get 

(44) \\P^ - P% < y^ WirP^ - K^nP^ + KY) - (rP^ - <D^<Dp/5^ + ^*pY)\\^ . 

By Assumption |2] and the definition of P^, Y = f*{X) + W, and $p/5^ and $„/5^ both 

-.liP) and L2,(P„) 



coincide with the function /*, regarded as an element of Ly{P) and Ly{Fn) respectively. 



Moreover by (|8j) <l>*pW = 0, so that 

<I>IY - <l>*pY = ($;$„ - $^$p)/5^ + <i>iw. 
Moreover 

(rl - K^n)P^ - (rl - <|.^$p)/?^ = (tI - K^nM - P') - iK^n " ^p^p)P\ 

From the assumption on $*$„ and the choice r = ^^^^, we have \\tI — $*$n|| < ^^^, 
so that (44) gives 

(r + e\) M -P%< ||($:«l>„, - ^*p^p){P' - PnL + \\KW\\2 + ^^ \\P^n - P% ■ 



The bound (43) is established by observing that t + eX — {k — Kq)/2 = kq + eX- □ 
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The probabilistic estimates for (42) are straightforward consequences of the law of 
large numbers for vector- valued random variables. More precisely, we recall the following 
probabilistic inequalities based on a result of |32l [33]; see also Th. 3.3.4 of [13] and [31] 
for concentration inequalities for Hilbert-space-valued random variables. 

Proposition 6. Let (^n)neN be a sequence of i.i.d. zero-mean random variables taking 
values in a real separable Hilbert space 7i and satisfying 



(45) mMn] < -mlM'H"'-' Vm > 2, 

where M and H are two positive constants. Then, for all n &N and t] > 
1 



P 



n 



(46) 

where g{t) = ^^^^^ 



E^^ 



j=i 



> 7] 



n 



< 2e M^ + Hri + My' M^ +2Hr] ^ 2e "" H 






or, for all 5 > 0, 



(47) 



P 



1 " 



n 



, Eb MV26 
< I — + ^^ 
n Jn 



> \-2e~\ 



Proof. Bound (46) is given in [32] with a wrong factor, see [33]. To show (47), observe 
that the inverse of the function 



l+t+v^T+2t 



is the function t + v 2t so that the equation 






2e~ 



has the solution 



ri = ^^\ -^^ + \l2- 



H \ nM"^ 



nM^ 



n 



Lemma 4. With probability greater than 1 — 4e ^ , the following inequalities hold, for any 
A > and e > 0, 



(48) 

and 
(49) 



\^*W\\ < I ^"^^ I ^v^y^ ) < -y^i^^ + L) 



n 



n 



n 

V 

if 5<n 



i$:$„-$:^$pii^. < 



k5 Ky/2S\ 3fi;v5 



n 



< 



n 



'n 

if S<n 



Proof Consider the £2 random variable ^*xW. From ([8]), E [$^Vr] = E [E [$3^W^|X]] = 
and, for any m > 2, 



^m*xw\\2 



E 



{Y,\{V.{X),W) 



7er 



m\ 



<K^E [\Wr] < K^—a^L""-^, 



due to ([5]) and (10). Applying (47) with H = ^J~kL and M = ^fna, and recalling the 
definition (fl9|, we get that 



P 



KW^Il2< 



Khb y/Kay25 



n 



n 
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with probability greater than 1 — 2e~^. 

Consider the random variable $x$x taking values in the Hilbert space of Hilbert-Schmidt 
operators (where ||-||hs denotes the Hilbert-Schmidt norm). One has that E [^x'^x] — 
$P$J, and, by ^ 

Hence 

|2 



E[||$x$x 



$p<I>;M|™] < E[\\<^x'^*x 



$P$*p||^s] (2^ 



m—2 



< 



ml 



2,,m-2 



-K K 



by ml > 2"^-\ Applying (47) with H = M 



K 






with probability greater than 1 — 2e ^ . 
S<n. 



The simplified bounds are clear provided that 

D 



Remark 1. In both (48) and (49), the condition S < n allows to simplify the bounds 
enlightening the dependence on n and the confidence level 1 — 4e~''. In the following 
results we always assume that S < n, but we stress the fact that this condition is only 
needed to simplify the form of the bounds. Moreover, observe that, for a fixed confidence 
level, this requirement on n is very weak - for example, to achieve a 99% confidence level, 
we only need to require that n > 6. 

The next proposition gives a bound on the sample error. This bound is uniform in the 
regularization parameter A in the sense that there exists an event independent of A such 
that its probability is greater than 1 — Ae'^ and (50) holds true. 



Proposition 7. Assume that e > or kq > 0. Let 6 > and n G N such that 6 < n, for 
any A > the bound 

(50) \\3^-8'L< ""^ 



K-P' 



(i + ll/^'-z^llJ 



^{kq + eX) 
holds with probability greater than 1 — Ae^^ , where c = max{v^2fi;((T + L), 3k}. 



Proof. Plug bounds (|49|) and (|48|) in (|43|), taking into account that 

< 11$*,$, 



$;$„-<i>:^$p)(/5"-/3^ 



•^p-^pIIhs 



/3^-/3' 



D 



By inspecting the proof, one sees that the constant kq in (43) can be replaced by any 
constant kx such that 

2 



/to < /«A < 



inf 



/3e^2|||/3|l2=i 



with probability 1, 



where Fa is the set of active features given by Corollary |4j If kq = and k\ > 0, which 
means that Tx is finite and the active features are linearly independent, one can improve 



the bound ( 52 ) below. Since we mainly focus on the case of linearly dependent dictionaries 



we will not discuss this point any further. 

The following proposition shows that the approximation error \\P'^ — (3' 

when A tends to zero. 



^"2 tends to zero 
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Proposition 8. If e > then 
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\im\\(3^-(3' 



0. 



A^O 



Proof. It is enough to prove the result for an arbitrary sequence {Xj)je'M converging to 



0. Putting p^ = (3^\ since \\^p(3-Y\ 



|$p/3 - r(X)||^ + ||/*(X) - r||^, by the 



definition of f3^ as the minimizer of (37) and the fact that 13^ solves $p/9 = /*, we get 
Condition ([T]) of Assumption 1 ensures that Pe{P^) is finite, so that 



l$p/5^'-r(x)||;<A,p,(/5^ 



and 



Pem<Pe{P' 



Since e > 0, the last inequality implies that (/3-')jgN is a bounded sequence in £2- Hence, 
possibly passing to a subsequence, (/3-')jgN converges weakly to some /?=„. We claim that 
p, = p'. Since p ^ ||$p/5 - f*{X)fp is Ls.c. 

||$p/3, - f*{X)\\l < hminf ||<l>p/5^' - f*{X)\\l < liminf A,p,(/3^) = 0, 

that is /5* G i3. Since Pe(-) is l.s.c, 

p,(/?,)<liminfp,(/?^)<p,(/?^). 

By the definition of /3^, it follows that /5=k = /5^ and, hence, 
(51) \imp,{P^)=p,m- 



To prove that P^ converges to P^ in £2, it is enough to show that limj_^oo \\P''\\2 = 

II 2 ^ 11/^^ II 2- Hence we are left to prove that 
,. Assume the contrary. This implies that, possibly passing to a 



?'^||2- Since 



1 2 is l.s.c, liminfj^oo 



limsupj.^^||/3-'||2 < llf. 112- 
subsequence. 



lim \\P^ 



> 



J^OO 



and, using (51), 



lim^w^l/3^1 <J2^7\P'l 



J^OO 



7er 



Ter 



However, since P h-> X]76r "^71/^71 i^ ^-S-c. 



liminf y ^^,1/5:^1 > J" wJp'\. 



J^OO 



7er 



7er 



D 



(52) 



From ( 50 ) and the triangular inequality, we easily deduce that 



IK'-/3 



£11 < 
2 — 



n{KQ + eX) 



{l + \\P^~P%) + \\P^-P' 



with probability greater that 1—Ae~^. Since the tails are exponential, the above bound and 
the Borel-Cantelli lemma imply the following theorem, which states that the estimator 
P^ converges to the generalized solution Z?"^, for a suitable choice of the regularization 
parameter A. 
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Theorem 2. Assume that e > and kq = 0. Let Xn be a choice of X as a function of n 
such that lini„_^oo A„ = and lim„^oo "^A^ — 2 logn = +oo. Then 

lim 1 1/9^" — /S^ll = with probability 1. 

If hq > 0, the above convergence result holds for any choice of Xn such that lini„_^oo A^ = 0. 

Proof. The only nontrivial statement concerns the convergence with probabihty 1. We 
give the proof only for kq = 0, being the other one similar. Let (A„)„>i be a sequence 
such that lim^^oo An = and lim„^oo ""-An — 2 logn = +cxd. Since lim^^oo A^ = 0, 
Proposition p^ ensures that lim„^oo \\/3^" "/^^IL ~ ^^ Hence, it is enough to show that 
lim^^oo \\Pn" "/^^"IL = *^^*^ probability 1. Let D = sup„>ie"^c(l + ||/?^" "/^^I^' 
which is finite since the approximation error goes to zero if A tends to zero. Given f] > 0, 

2 I I 

let 6 = nXf^jy2 < n ioT n large enough, so that the bound (50) holds providing that 



^[pn" -(3^"\\2>V] <4e""^'5^. 



2 

The condition that lim„^oo ''^A^ — 2 logn = +cx) implies that the series Yl'^=i^~^ "^ 
converges and the Borel-Cantelli lemma gives the thesis. D 

Remark 2. The two conditions on A„ in the above theorem are clearly satisfied with the 
choice An = {^/nY with < r < |. Moreover, by inspecting the proof, one can easily 
check that to have the convergence of /3^" to (3^ in probability, it is enough to require that 
limn_oo An = and limn-^oo nXl = +oo. 



Let /„ = /.A„. Since /* = f^e and E [|/n(X) - f*{X)\^] = ||$p(/3^ - Pn\\p, the above 



theorem implies that 

limE[|/n(X)-r(X)|2] =0 



with probability 1, that is, the consistency of the elastic-net regularization scheme with 
respect to the square loss. 

Let us remark that we are also able to prove such consistency without assuming ([T]) in 
Assumption [2j To this aim we need the following lemma, which is of interest by itself. 

Lemma 5. Instead of Assumption^ assume that the regression model is given by 

Y = f*{X) + W, 



where f* : X ^ y is a bounded function and W satisfies (|8j) and (|9j) . For fixed X and 
e > 0, with probability greater than 1 — 2e~^ we have 

(53) \mf-fi-'t>uf' -- - ' ^""^^ ' ^^11/'-/* 



where f^ = fpx and Dx = sup^g;^. |/^(x) - /*(x)|. 




We notice that in (53), the function / — /* is regarded both as an element of Ly(Pn] 
and as an element of X^(P). 

Proof. Consider the £2- valued random variable 

z = ^*Af\x) - r (X)) z, = {f\x) - r (X), ^,(x)> . 

A simple computation shows that E [Z] = ^*p{f^ — /*) and 

liz|i,<v^i/^(x)-r(x)|. 
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Hence, for any m > 2, 

|2l / r, m I t\ 
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E[||Z-E[Z] 



\7. 



<E[||Z-E[Z]||^] 2v^sup|/^ 



.X] 



xeX 



nx)\ 



m-2 



< kE 



\f\x) - r (x)r ( 2v/;^sup \f\x) - r{x) 

x&X 



m,— 2 



ml 



A 



Applying (47) with H = ^D^ and M = v^ ||/^ - /*||^, we obtain the bound (53). D 

Observe that under Assumption ^\ and by the definition of /?^ one has that D^ < 
li^P^ — P'^W^, so that (53) becomes 



K^n-^p^p){P'-pnL< 



+ 



n 



n 



Since $p is a compact operator this bound is tighter than the one deduced from (49). 
However, the price we pay is that the bound does not hold uniformly in A. We are now 
able to state the universal strong consistency of the elastic-net regularization scheme. 

Theorem 3. Assume that {X, Y) satisfy M) and M) and that the regression function f* 
is bounded. If the linear span of features {ip^)^^T is dense in Ly{P) and e > 0, then 

lim E n/„(X) - /*(X)|^1 = with probability 1, 

n— >oo 

provided that lim„_^oo A„ = and lim„^oo nX'^ — 2 logn = +oo. 



Proof. As above we bound separately the approximation error and the sample error. As 
for the first term, let f^ = fp\. We claim that E [|/^(X) — /*(X)p] goes to zero when A 
goes to zero. Given // > 0, the fact that the linear span of the features ((y9^)^gr is dense 
in Ly{P) implies that there is /?'' G £2 such that p^ (/?'') < 00 and 



Let Ar 



E[|//3.(x)-rp]<E[|r(x)-rr] 

i+p1{pr,) . then, for any A < A,,, 



E [\f\X) - r(X)|2] < (E [\f\X) - rp] - E [|r(X) - rp]) + Ap,(/3^) 

< (E [\fp.{X) -Y\']-E or (X) -Y\'])+ Xp,{(3'^) 

< rj + rj . 

As for the sample error, we let /^ = /^a (so that /„ = /^") and observe that 



E [\f\X) - r^(X)p] = ||$p(/3„^ - (3')\\l < K ||/5,^ - /?■ 
We bound 11/5;^ — /?'*' 1 1 by (53) observing that 



'A||2 
I2 ' 



D, 



sup \f\x) - f*ix)\ < sup \ffsxix)\ + sup \r{x) 
x£X xex xex 

^snp\f*{x)\<D^ 
xex vA 



< v/« W 



where D is a suitable constant and where we used the crude estimate 

|2 , ^X/^Xn , ^X/^N -.r^ ri-,^|2l 



Ae \\P% < 8\(3^) < S\0) = E [|r|'] 
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Hence (53) yields 
(54) 



K{f-n-^*p{f-n\u< 



^6D V2i^\\f\X)-r{X)\ 



Xn 



n 



Observe that the proof of (43) does not depend on the existence of P"^ provided that we 
replace both $p/3^ G Ly{P) and $„/3;^ G Ly{Fn) with /*, and we take into account that 
both ^p(3^ G -^y(-P) and $n/3^ G L^(P„) are equal to /^. Hence, plugging (|54| and ( [isj ) 
in (43) we have that with probability greater than 1 — Ae^^ 



Ko + eX \^/n yX- 



n 



n 



where D is a suitable constant and 6 < n. The thesis now follows by combining the bounds 
on the sample and approximation errors and repeating the proof of Theorem [2] D 

To have an explicit convergence rate, one needs a explicit bound on the approximation 



error /3 — /3^ , for example of the form \\/3 ~ /3^ 



0{X^'). This is out of the scope 



of the paper. We report only the following simple result. 

Proposition 9. Assume that the features (p^ are in finite number and linearly indepen- 
dent. Let N* = |supp(/3'^)| and w* = sup^ggupp(^e){w^}, then 

p^-(3'\\^<DN*X. 

With the choice A„ = 4^, for any 6 > and n G N with S < n 

cV6 



(55) 



\K-p^ 



^11 < — 



DN* \ DN* 

+ 



n 



with probability greater than 1 — 4e ^ , where D 
L),3fi;}. 



2ko 



+ e 



and c = iaax{\/2K(a 



Proof. Observe that the assumption on the set of features is equivalent to assume that 
Ko > 0. First, we bound the approximation error \\p^ — jS' 



T 



2 ' 



Eq. (36) gives 



^" . As usual, with the choice 



(5^ -(3' 



^ [Sa {{rl - $:^$p)/3^ + <l>*p$p/5^) - Sa {r(3') + Sa {t(3') - r/3^] - -^^(3'. 



T + eX 



T + eX 



Property (30) implies that 



11/5^-/3' 



£11 < 

2 - T + eX 



(||(r/ - <|.^$p)(/3^ - (3%\^ + ||Sa (r/?^) - r/?^!!^) 



eX 



Since \\tI -<^*p<^p\\ < ^^=^, ||/3^IU < A^*ii'^^i 



T + eX 

2 ' 11/^ 112 



12 ■ 



and 



\Sx{t(3')-t(3'\\^<w*N' 



X 



one has 



1/3^-/3' 



2 — 



2eX 



K + K^ + 2eX 2 A 

w JXI — I 

2{KQ + eX) \K + KQ + 2eX 2 KQ + K + 2eX 



< 



, w 
'2ko 



+ e 



JN*X = DN*X. 



The bound (55) is then an straightforward consequence of (52). 



D 
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Let us observe this bound is weaker than the results obtained in |26] since the constant 
kq is a global property of the dictionary, whereas the constants in [26] are local. 

4.3. Adaptive choice. In this section, we suggest an adaptive choice of the regularization 
parameter A. The main advantage of this selection rule is that it does not require any 
knowledge of the behavior of the approximation error. To this aim, it is useful to replace 
the approximation error with the following upper bound 



(56) 



^(A) 



sup 

0<A'<A 



P'' - P' 



The following simple result holds. 

Lemma 6. Given e > 0, A is an increasing continuous function and 

||/3^-/3'||2^^(A) <A<oo 
lim ^(A) = 0. 

A^0+ 

Proof. First of all, we show that X \-^ (3^ is a continuous function. Fixed A > 0, for any h 
such that A + /i > 0, Eq. g with r = ^ and Corollary [l| give 

K' — 1^0 W nX+h q\\ 



< 



K + Ko + 2e{\ + h) 

1 
+ 



T + e{\ + h) 



P' 
SA+h (/9') 



P' 



T + e\ 



Sa (/?') 



where j3' = {tI ~^*p^p)i3^ + ^*pY does not depend on h and we wrote Tx to make explicit 
the dependence of the map T on the regularization parameter. Hence 



P'^^" - p' 



< 



r + e{\ + h) 



kq + e{\ + h) 
1 



1 



+ 



Sx+h{P')-Sx{P')h 



+ 



T + e\ 
The claim follows by observing that (assuming for simplicity that /i > 0) 

\\Sx+h{p')-^x{p')\\1= Y. I/?; - sgn(/?;KAp + Y. ^■•''' 

w^X<\P'\<w-y{X+h) |/3;|>i«^(A+/i) 



wt^K' 



<h' Y ^',<h' E {P',l\f<h' 



3/ II 2 



/A^ 



l/3;i>"'7^ 



m>w^x 



which goes to zero if h tends to zero. 
Now, by the definition of P^ and P^ 



eXW\\l<¥. 



so that 



|$p/3^ - /*(X)|' + Ap,(/5^) < E [|$P/?^ - r (X)|2] + \p,m = Xp,{(3' 



\P^-P%<\\P'\\, + ^PeiP')=:A. 



Hence ^(A) < A for all A. Clearly ^(A) is an increasing function of A; the fact that 
11/3'^ —/5^ 1 1 is continuous and goes to zero with A ensures that the same holds true for 
^(A). D 
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Notice that we replaced the approximation error with ^(A) just for a technical reason, 
namely to deal with an increasing function of A. If we have a monotonic decay rate at our 
disposal, such as \\/3'^ — P^W^ ^ A° for some a > and for A -^ 0, then clearly ^(A) x A°. 

Now, we fix £ > and 6 > 2 and we assume that kq = 0. Then we simplify the 



bound ( 52 ) observing that 



(57) K~P%<C(-yL-^+AiX) 



where C = cv5{l + A); the bound holds with probability greater than l — Ae^ uniformly 
for all A > 0. 



When A increases, the first term in (57) decreases whereas the second increases; hence to 
have a tight bound a natural choice of the parameter consists in balancing the two terms 
in the above bound, namely in taking 

Ar = sup{AG]0,oo[|^(A) = -^}. 
Since ^(A) is continuous, \opt = ^(A°p*) and the resulting bound is 

(58) \\P''-P'\\"^^^^- 

This method for choosing the regularization parameter clearly requires the knowledge of 
the approximation error. To overcome this drawback, we discuss a data-driven choice for 



A that allows to achieve the rate (58) without requiring any prior information on ^(A). 
For this reason, such choice is said to be adaptive. The procedure we present is also 
referred to as an a posteriori choice since it depends on the given sample and not only on 
its cardinality n. In other words, the method is purely data-driven. 

Let us consider a discrete set of values for A defined by the geometric sequence 

\i = Xo2' ieN Ao > 0. 

Notice that we may replace the sequence Ao2* be any other geometric sequence Aj = Aqi?* 



with g > 1; this would only lead to a more complicated constant in (60). Define the 
parameter AjJ; as follows 

AC 
(59) A+ = max{A,| \\p^^ - {3^- \\^ < _^ for all j = 0, . . . , i} 

(with the convention that A_i = Aq). This strategy for choosing A is inspired by a 
procedure originally proposed in [27j for Gaussian white noise regression and which has 
been widely discussed in the context of deterministic as well as stochastic inverse problems 
(see [6l|35]). In the context of nonparametric regression from random design, this strategy 
has been considered in [16j and the following proposition is a simple corollary of a result 
contained in [H 



Proposition 10. Provided that Aq < A°^*, the following hound holds with probability 
greater than 1 — 4e~'' 



20C 
< 



Proof. The proposition results from Theorem 2 in [16]. For completeness, we report here 



a proof adapted to our setting. Let il be the event such that (57) holds for any A > 0; we 
have that F[Q] > 1 — 46"*^ and we fix a sample point in fi. 
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The definition of A°p* and tlie assumption Aq < A°p* ensure tliat ^(Aq) < r^^-^ ■ 
tlie set {\i I ^(Aj) < J^^ } is not empty and we can define 

1 



29 

^ Hence 



neX 



The fact that (Aj 
(61) 



i6N 



A* = max{Aj | A{\i) < 



is a geometric sequence implies that 

a: < \T < 2X1 



}■ 



while (p7| with the definition of A* ensures that 



(62) 



\\Pn--f3' 



'^\<C 



^/neXl 



+a{k: 



< 



2C 



y/neXl 



We show that A* < A^. Indeed, for any Aj < A*, using (57) twice, we get 



\\Pn--P'n 



<c- ' 



1 2 11^" I II 2 



fe^-^<^''-^v^^-^<^"))^ 



AC 



where the last inequahty holds since A^ < A* < A°p* and ^(A) < -i^ for all A < X°f^. 
Now 2™Ao < a; < A+ = 2™+'= for some m. A; G N, so that 



/?n" - Pn- 



fc-i 

£=0 

< 4^ 



n 



' n II2 



fc-1 

sE 

^=0 



4C 



A/neA- 



m-l-^ 



v/^^A* ^ 2^ 



00 _ 



AC 



v^eA* 



Finally, recalling (|61|) and (|62|), we get the bound (|60|): 



/5^ 



< 



oXr, 



/5^-/?^ + /3>-/9' 



c 



2C 



'^11 < _::^ — + 

2 y/neX*^ ^/neX* 



<20C- 



\/neXn 



opt ' 
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Notice that the a priori condition Aq < A°p* is satisfied, for example, if Aq < -^hfii- 
To illustrate the implications of the last Proposition, let us suppose that 



(63) 



\\(5^-(5' 



X^ 



for some unknown a g]0, 1]. One has then that A°p* x n ^(a+i) and /?^" — /3 



We end noting that, if we specialize our analysis to least squares regularized with a 
pure ^2-penalty (i.e. setting w^ = 0, V7 G F), then our results lead to the error estimate 
in the norm of the reproducing kernel space TC obtained in [361 E]- Indeed, in such a 
case, (3^ is the generalized solution /?''' of the equation $p/9 = /* and the approximation 
error satisfies (63) under the a priori assumption that the regression vector P"^ is in the 
range of ($p$p)'^ for some < a < 1 (the fractional power makes sense since ^*p^p is 



that bot 



< n 2("+i) . To 
h fn = f x+ and 



a positive operator). Under this assumption, it follows that 

compare this bound with the results in the literature, recal 

/* = /^t belongs to the reproducing kernel Hilbert space H defined in Proposition [3] In 

particular, one can check that /3^ G ran ($p$p)" if and only if /* G ranL 



f x^ 



2a + l 

^^ , where 
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Lk '■ Ly{P) -^ Ly{P) is the integral operator whose kernel is the reproducing kernel K 



To] . Under this condition, the following bound holds 



n 2(a + l)^ 



ii/n-rik< 

which gives the same rate as in Theorem 2 of |36j and Corollary 17 of [7] . 
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